home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
SGI Freeware 2002 November
/
SGI Freeware 2002 November - Disc 2.iso
/
dist
/
fw_glimpse.idb
/
usr
/
freeware
/
catman
/
u_man
/
cat1
/
glimpse.Z
/
glimpse
Wrap
Text File
|
1997-09-09
|
49KB
|
991 lines
GGGGLLLLIIIIMMMMPPPPSSSSEEEE((((llll)))) UUUUNNNNIIIIXXXX SSSSyyyysssstttteeeemmmm VVVV ((((OOOOccccttttoooobbbbeeeerrrr 11111111,,,, 1111999999995555)))) GGGGLLLLIIIIMMMMPPPPSSSSEEEE((((llll))))
NNNNAAAAMMMMEEEE
_g_l_i_m_p_s_e _3._0 - search quickly through entire file systems
OOOOVVVVEEEERRRRVVVVIIIIEEEEWWWW
_G_l_i_m_p_s_e (which stands for GLobal IMPlicit SEarch) is an
indexing and query system that allows you to search through
all your files very quickly. Glimpse supports most of
_a_g_r_e_p's options (_a_g_r_e_p is our powerful version of _g_r_e_p)
including approximate matching (e.g., finding misspelled
words), Boolean queries, and even some limited forms of
regular expressions. It is used in the same way, except that
you don't have to specify file names. So, if you are
looking for a _n_e_e_d_l_e anywhere in your file system, all you
have to do is say _g_l_i_m_p_s_e _n_e_e_d_l_e and all lines containing
_n_e_e_d_l_e will appear preceded by the file name.
To use glimpse you first need to index your files with
glimpseindex, which is typically run every night.
_g_l_i_m_p_s_e_i_n_d_e_x -_o ~ will index everything at or below your
home directory. See man glimpseindex for more details.
Glimpse is also available for HTTP servers, to provide
search of local data, as a set of tools called _G_l_i_m_p_s_e_H_T_T_P.
See http://glimpse.cs.arizona.edu:1994/ghttp/ for more
information.
Glimpse includes all of agrep and can be used instead of
agrep by giving a file name(s) at the end of the command.
This will cause glimpse to ignore the index and run agrep as
usual. For example, _g_l_i_m_p_s_e -_1 _p_a_t_t_e_r_n _f_i_l_e is the same as
_a_g_r_e_p -_1 _p_a_t_t_e_r_n _f_i_l_e. We added a new option to agrep: -r
searches recursively the directory and everything below it
(see agrep options below); it is used only when glimpse
reverts to agrep.
Mail glimpse-request@cs.arizona.edu to be added to the
glimpse mailing list. Mail glimpse@cs.arizona.edu to report
bugs, ask questions, discuss tricks for using glimpse, etc.
(this is a moderated mailing list with very little traffic,
mostly announcements). HTML version of these manual pages
can be found in
http://glimpse.cs.arizona.edu:1994/glimpsehelp.html Also,
see the glimpse developers home page in
http://glimpse.cs.arizona.edu:1994/
SSSSYYYYNNNNOOOOPPPPSSSSIIIISSSS
gggglllliiiimmmmppppsssseeee [ ----((((aaaaggggrrrreeeepppp''''ssss ooooppppttttiiiioooonnnnssss)))) ----CCCC ----FFFF _f_i_l_e__p_a_t_t_e_r_n ----HHHH _d_i_r_e_c_t_o_r_y
----JJJJ _h_o_s_t__n_a_m_e ----KKKK _p_o_r_t__n_u_m_b_e_r ----LLLL xxxx ----NNNN ----TTTT _d_i_r_e_c_t_o_r_y ----VVVV ----WWWW ----zzzz ]
_p_a_t_t_e_r_n
IIIINNNNTTTTRRRROOOODDDDUUUUCCCCTTTTIIIIOOOONNNN
We start with simple ways to use glimpse and describe all
Page 1 (printed 11/3/95)
GGGGLLLLIIIIMMMMPPPPSSSSEEEE((((llll)))) UUUUNNNNIIIIXXXX SSSSyyyysssstttteeeemmmm VVVV ((((OOOOccccttttoooobbbbeeeerrrr 11111111,,,, 1111999999995555)))) GGGGLLLLIIIIMMMMPPPPSSSSEEEE((((llll))))
the options in detail later on. Once an index is built,
using glimpseindex, searching for _p_a_t_t_e_r_n is as easy as
saying
_g_l_i_m_p_s_e _p_a_t_t_e_r_n
The output of glimpse is similar to that of _a_g_r_e_p (or any
other grep), except that the name of the file containing the
match appears at the beginning of the line by default. The
pattern can be any agrep legal pattern including a regular
expression or a Boolean query (e.g., searching for Tucson
AND Arizona is done by _g_l_i_m_p_s_e '_T_u_c_s_o_n;_A_r_i_z_o_n_a').
The speed of glimpse depends mainly on the number and sizes
of the files that contain a match and only to a second
degree on the total size of all indexed files. If the
pattern is reasonably uncommon, then all matches will be
reported in a few seconds even if the indexed files total
500MB or more. Some information on how glimpse works and a
reference to a detailed article are given below.
Most of agrep (and other grep's) options are supported,
including approximate matching. For example,
_g_l_i_m_p_s_e -_1 '_T_u_s_o_n;_A_r_e_z_o_n_a'
will output all lines containing both patterns allowing one
spelling error in any of the patterns (either insertion,
deletion, or substitution), which in this case is definitely
needed.
_g_l_i_m_p_s_e -_w -_i '_p_a_r_e_n_t'
specifies case insensitive (-i) and match on complete words
(-w). So 'Parent' and 'PARENT' will match, 'parent/child'
will match, but 'parenthesis' or 'parents' will not match.
(Starting at version 3.0, glimpse can be much faster when
these two options are specified, especially for very large
indexes. You may want to set an alias especially for
"glimpse -w -i".)
The -F option provides a pattern that must match the file
name. For example,
_g_l_i_m_p_s_e -_F '\._c$' _n_e_e_d_l_e
will find the pattern _n_e_e_d_l_e in all files whose name ends
with .c. (Glimpse will first check its index to determine
which files may contain the pattern and then run agrep on
the file names to further limit the search.) The -F option
_s_h_o_u_l_d _n_o_t be put at the end after the main pattern (e.g.,
"glimpse needle -F hay" is incorrect).
Page 2 (printed 11/3/95)
GGGGLLLLIIIIMMMMPPPPSSSSEEEE((((llll)))) UUUUNNNNIIIIXXXX SSSSyyyysssstttteeeemmmm VVVV ((((OOOOccccttttoooobbbbeeeerrrr 11111111,,,, 1111999999995555)))) GGGGLLLLIIIIMMMMPPPPSSSSEEEE((((llll))))
DDDDEEEETTTTAAAAIIIILLLLEEEEDDDD DDDDEEEESSSSCCCCRRRRIIIIPPPPTTTTIIIIOOOONNNN OOOOFFFF GGGGLLLLIIIIMMMMPPPPSSSSEEEE
The use of glimpse is similar to that of agrep (or any other
grep), except that there is no need to specify file names.
Most of agrep's (and other greps) options are supported. It
is important to have in mind that the search is over many
files. Using very common patterns may lead to a huge number
of matches. Running _g_l_i_m_p_s_e _a will work, but will take a
long time and will probably output all of the indexed files.
We start with the new options, and then list all of agrep's
original options (with some additional comments when
relevant).
TTTThhhheeee NNNNeeeewwww OOOOppppttttiiiioooonnnnssss ooooffff GGGGlllliiiimmmmppppsssseeee
----aaaa prints attribute names. This option applies only to
structured data (used with glimpseindex -s); this
option was added to support the Harvest project. See
STRUCTURED QUERIES below for more information and also
http://harvest.cs.colorado.edu for more information
about the Harvest project.
----CCCC tells glimpse to send its queries to _g_l_i_m_p_s_e_s_e_r_v_e_r.
See man glimpseserver for more details.
----EEEE prints the lines in the index (as they appear in the
index) which match the pattern. Used mostly for
debugging and maintenance of the index.
----FFFF _f_i_l_e__p_a_t_t_e_r_n
limits the search to those files whose name (including
the whole path) matches _f_i_l_e__p_a_t_t_e_r_n. If _f_i_l_e__p_a_t_t_e_r_n
matches a directory, then all files with this directory
on their path will be considered. To limit the search
to actual file names, use $ at the end of the pattern.
_f_i_l_e__p_a_t_t_e_r_n can be a regular expression and even a
Boolean pattern. (Glimpse simply runs agrep
_f_i_l_e__p_a_t_t_e_r_n on the list of file names obtained from
the index to filter the list.) For example,
glimpse -F 'src#\.c$' needle
will search for needle in all .c files with src
somewhere along the path. The -F _f_i_l_e__p_a_t_t_e_r_n must
appear before the search pattern (e.g., glimpse needle
-F '\.c$' will not work). It is possible to use some
of agrep's options when matching file names. In this
case all options as well as the file_pattern should be
in quotes. (-B and -v do not work very well as part of
a file_pattern.) For example,
glimpse -F '-1 gopherc' pattern
will allow one spelling error when matching gopherc to
Page 3 (printed 11/3/95)
GGGGLLLLIIIIMMMMPPPPSSSSEEEE((((llll)))) UUUUNNNNIIIIXXXX SSSSyyyysssstttteeeemmmm VVVV ((((OOOOccccttttoooobbbbeeeerrrr 11111111,,,, 1111999999995555)))) GGGGLLLLIIIIMMMMPPPPSSSSEEEE((((llll))))
the file names (so "gopherrc" and "gopher" will be
considered as well).
glimpse -F '-v \.c$' counter
will search for 'counter' in all files _e_x_c_e_p_t for .c
files.
----HHHH _d_i_r_e_c_t_o_r_y__n_a_m_e
searches for the index and the other .glimpse files in
_d_i_r_e_c_t_o_r_y__n_a_m_e. The default is the home directory.
This option is useful, for example, if several
different indexes are maintained for different archives
(e.g., one for mail messages, one for source code, one
for articles).
----JJJJ _h_o_s_t__n_a_m_e
used in conjunction with glimpseserver (-C) to connect
to one particular server. See man glimpseserver for
more details.
----KKKK _p_o_r_t__n_u_m_b_e_r
used in conjunction with glimpseserver (-C) to connect
to one particular server at the specified TCP port
number. See man glimpseserver for more details.
----LLLL xxxx |||| xxxx::::yyyy |||| xxxx::::yyyy::::zzzz
if one number is given, it is a limit on the total
number of matches. Glimpse outputs only the first x
matches. If -l is used (i.e., only file names are
sought), then the limit is on the number of files;
otherwise, the limit is on the number of records. If
two numbers are given (x:y), then y is an added limit
on the total number of files. If three numbers are
given (x:y:z), then z is an added limit on the number
of matches per file. If any of the x, y, or z is set
to 0, it means to ignore it (in other words 0 =
infinity in this case); for example, -L 0:10 will
output all matches to the first 10 files that contain a
match.
----NNNN searches only the index (so the search is faster). If
-o or -b are used then the result is the number of
files that have a potential match plus a prompt to ask
if you want to see the file names. (If -y is used,
then there is no prompt and the names of the files will
be shown.) This could be a way to get the matching
file names without even having access to the files
themselves. However, because only the index is
searched, some potential matches may not be real
matches. In other words, with -N you will not miss any
file but you may get extra files. For example, since
Page 4 (printed 11/3/95)
GGGGLLLLIIIIMMMMPPPPSSSSEEEE((((llll)))) UUUUNNNNIIIIXXXX SSSSyyyysssstttteeeemmmm VVVV ((((OOOOccccttttoooobbbbeeeerrrr 11111111,,,, 1111999999995555)))) GGGGLLLLIIIIMMMMPPPPSSSSEEEE((((llll))))
the index stores everything in lower case, a case-
sensitive query may match a file that has only a case-
insensitive match. Boolean queries may match a file
that has all the keywords but not in the same line
(indexing with -b allows glimpse to figure out whether
the keywords are close, but it cannot figure out from
the index whether they are exactly on the same line or
in the same record without looking at the file). If
the index was not build with -o or -b, then this option
outputs the number of _b_l_o_c_k_s matching the pattern. This
is useful as an indication of how long the search will
take. All files are partitioned into usually 200-250
blocks. The file ....gggglllliiiimmmmppppsssseeee____ssssttttaaaattttiiiissssttttiiiiccccssss contains the
total number of blocks (or gggglllliiiimmmmppppsssseeee ----NNNN aaaa will give a
pretty good estimate; only blocks with no occurrences
of 'a' will be missed).
----QQQQ an extension to -N that not only displays the filename
where the match occurs, but the exact occurrences
(offsets) as seen in the index.
----TTTT ddddiiiirrrreeeeccccttttoooorrrryyyy
Use _d_i_r_e_c_t_o_r_y as a place where temporary files are
built. (Glimpse produces some small temporary files
usually in /tmp.) This option is useful mainly in the
context of structured queries for the Harvest project,
where the temporary files may be non-trivial.
----VVVV prints the current version of glimpse.
----WWWW The default for Boolean AND queries is that they cover
one record (the default for a record is one line) at a
time. For example, glimpse 'good;bad' will output all
lines containing both 'good' and 'bad'. The -W option
changes the scope of Booleans to be the whole file.
Within a file glimpse will output all matches to any of
the patterns. So, glimpse -W 'good;bad' will output
all lines containing 'good' _o_r 'bad', but only in files
that contain both patterns. For structured queries,
the scope is always the whole attribute or file.
----zzzz Allow customizable filtering, using the file
.glimpse_filters to perform the programs listed there
for each match. The best example is
compress/decompress. If .glimpse_filters include the
line
*.Z uncompress <
(separated by tabs) then before indexing any file that
matches the pattern "*.Z" (same syntax as the one for
.glimpse_exclude) the command listed is executed first
(assuming input is from stdin, which is why uncompress
needs <) and its output (assuming it goes to stdout) is
Page 5 (printed 11/3/95)
GGGGLLLLIIIIMMMMPPPPSSSSEEEE((((llll)))) UUUUNNNNIIIIXXXX SSSSyyyysssstttteeeemmmm VVVV ((((OOOOccccttttoooobbbbeeeerrrr 11111111,,,, 1111999999995555)))) GGGGLLLLIIIIMMMMPPPPSSSSEEEE((((llll))))
indexed. The file itself is not changed (i.e., it
stays compressed). Then if glimpse -z is used, the
same program is used on these files on the fly. Any
program can be used (we run 'exec'). For example, one
can filter out parts of files that should not be
indexed. Glimpseindex tries to apply all filters in
.glimpse_filters in the order they are given. For
example, if you want to uncompress a file and then
extract some part of it, put the compression command
(the example above) first and then another line that
specifies the extraction. Note that this can slow down
the search because the filters need to be run before
files are searched. (See also glimpseindex.)
TTTThhhheeee OOOOppppttttiiiioooonnnnssss ooooffff AAAAggggrrrreeeepppp SSSSuuuuppppppppoooorrrrtttteeeedddd bbbbyyyy GGGGlllliiiimmmmppppsssseeee
----# # is an integer between 1 and 8 specifying the maximum
number of errors permitted in finding the approximate
matches (the default is zero). Generally, each
insertion, deletion, or substitution counts as one
error. It is possible to adjust the relative cost of
insertions, deletions and substitutions (see -I -D and
-S options). Since the index stores only lower case
characters, errors of substituting upper case with
lower case may be missed (see LIMITATIONS).
----cccc Display only the count of matching records. Only files
with count > 0 are displayed.
----dddd ''''_d_e_l_i_m''''
Define _d_e_l_i_m to be the separator between two records.
The default value is '$', namely a record is by default
a line. _d_e_l_i_m can be a string of size at most 8 (with
possible use of ^ and $), but not a regular expression.
Text between two _d_e_l_i_m's, before the first _d_e_l_i_m, and
after the last _d_e_l_i_m is considered as one record. For
example, -d '$$' defines paragraphs as records and -d
'^From ' defines mail messages as records. _g_l_i_m_p_s_e
matches each record separately. This option does not
currently work with regular expressions. The -d option
is especially useful for Boolean AND queries, because
the patterns need not appear in the same line but in
the same record. For example, _g_l_i_m_p_s_e -_F _m_a_i_l -_d
'^_F_r_o_m ' '_g_l_i_m_p_s_e;_a_r_i_z_o_n_a;_a_n_n_o_u_n_c_e_m_e_n_t' will output all
mail messages (in their entirety) that have the 3
patterns anywhere in the message (or the header),
assuming that files with 'mail' in their name contain
mail messages. If you want to output a whole file that
matches a Boolean pattern, you can use -d 'O9g1Xs' (or
another garbage pattern). If the delimiter doesn't
appear anywhere, the whole file is one record (there is
a limit, however, to the size of records, see
LIMITATIONS). GGGGlllliiiimmmmppppsssseeee wwwwaaaarrrrnnnniiiinnnngggg: Use this option with
Page 6 (printed 11/3/95)
GGGGLLLLIIIIMMMMPPPPSSSSEEEE((((llll)))) UUUUNNNNIIIIXXXX SSSSyyyysssstttteeeemmmm VVVV ((((OOOOccccttttoooobbbbeeeerrrr 11111111,,,, 1111999999995555)))) GGGGLLLLIIIIMMMMPPPPSSSSEEEE((((llll))))
care. If the delimiter is set to match mail messages,
for example, and glimpse finds the pattern in a regular
file, it may not find the delimiter and will therefore
output the whole file. (The -t option - see below -
can be used to put the _d_e_l_i_m at the end of the record.)
----eeee _p_a_t_t_e_r_n
Same as a simple _p_a_t_t_e_r_n argument, but useful when the
_p_a_t_t_e_r_n begins with a `----'.
----hhhh Do not display filenames.
----iiii Case-insensitive search - e.g., "A" and "a" are
considered equivalent. Glimpse's index stores all
patterns in lower case (see LIMITATIONS below).
----kkkk No symbol in the pattern is treated as a meta
character. For example, glimpse -k 'a(b|c)*d' will find
the occurrences of a(b|c)*d whereas glimpse 'a(b|c)*d'
will find substrings that match the regular expression
'a(b|c)*d'. (The only exception is ^ at the beginning
of the pattern and $ at the end of the pattern, which
are still interpreted in the usual way. Use \^ or \$ if
you need them verbatim.)
----llll Output only the files names that contain a match.
----nnnn Each matching record (line) is prefixed by its record
(line) number in the file.
----rrrr (This option is valid only when a file name is given
and glimpse is used as agrep; it is a new agrep
option.) If the file name is a directory name, glimpse
will search (recursively) the whole directory and
everything below it. Glimpse will not use its index.
----ssss Work silently, that is, display nothing except error
messages. This is useful for checking the error
status.
----tttt Output the record starting from the end of _d_e_l_i_m to
(and including) the next _d_e_l_i_m. This is useful for
cases where _d_e_l_i_m should come at the end of the record.
(See warning for the -d option.)
----wwww Search for the pattern as a word - i.e., surrounded by
non-alphanumeric characters. For example, _g_l_i_m_p_s_e -_w
-_1 _c_a_r will match cars, but not characters and not
car10. The non-alphanumeric _m_u_s_t surround the match;
they cannot be counted as errors. This option does not
work with regular expressions.
Page 7 (printed 11/3/95)
GGGGLLLLIIIIMMMMPPPPSSSSEEEE((((llll)))) UUUUNNNNIIIIXXXX SSSSyyyysssstttteeeemmmm VVVV ((((OOOOccccttttoooobbbbeeeerrrr 11111111,,,, 1111999999995555)))) GGGGLLLLIIIIMMMMPPPPSSSSEEEE((((llll))))
----xxxx The pattern must match the whole line. (This option is
translated to -w when the index is searched and it is
used only when the actual text is searched. It is of
limited use in glimpse.)
----yyyy Do not prompt. Proceed with the match as if the answer
to any prompt is y.
----BBBB Best match mode. (Warning: -B sometimes misses
matches. It is safer to specify the number of errors
explicitly.) When -B is specified and no exact matches
are found, glimpse will continue to search until the
closest matches (i.e., the ones with minimum number of
errors) are found, at which point the following message
will be shown: "the best match contains x errors,
there are y matches, output them? (y/n)" This message
refers to the number of matches found in the index.
There may be many more matches in the actual text (or
there may be none if -F is used to filter files). When
the -#, -c, or -l options are specified, the -B option
is ignored. In general, -B may be slower than -#, but
not by very much. Since the index stores only lower
case characters, errors of substituting upper case with
lower case may be missed (see LIMITATIONS).
----DDDD_k Set the cost of a deletion to _k (_k is a positive
integer). This option does not currently work with
regular expressions.
----GGGG Output the (whole) files that contain a match.
----IIII_k Set the cost of an insertion to _k (_k is a positive
integer). This option does not currently work with
regular expressions.
----SSSS_k Set the cost of a substitution to _k (_k is a positive
integer). This option does not currently work with
regular expressions.
The characters `$$$$', `^'''',,,, `****', `[[[[',,,, `]]]]',,,, `^^^^', `||||', `((((', `))))',
`!!!!', and `\\\\' can cause unexpected results when included in
the _p_a_t_t_e_r_n, as these characters are also meaningful to the
shell. To avoid these problems, enclose the entire pattern
in single quotes, i.e., 'pattern'. Do not use double quotes
(").
PPPPAAAATTTTTTTTEEEERRRRNNNNSSSS
_g_l_i_m_p_s_e supports a large variety of patterns, including
simple strings, strings with classes of characters, sets of
strings, wild cards, and regular expressions (see
LIMITATIONS).
Page 8 (printed 11/3/95)
GGGGLLLLIIIIMMMMPPPPSSSSEEEE((((llll)))) UUUUNNNNIIIIXXXX SSSSyyyysssstttteeeemmmm VVVV ((((OOOOccccttttoooobbbbeeeerrrr 11111111,,,, 1111999999995555)))) GGGGLLLLIIIIMMMMPPPPSSSSEEEE((((llll))))
SSSSttttrrrriiiinnnnggggssss
Strings are any sequence of characters, including the
special symbols `^' for beginning of line and `$' for
end of line. The following special characters ( `$$$$',
`^'''',,,, `****', `[[[[',,,, `^^^^', `||||', `((((', `))))', `!!!!', and `\\\\' ) as
well as the following meta characters special to
glimpse (and agrep): `;;;;', `,,,,', `####', `<<<<', `>>>>', `----', and
`....', should be preceded by `\' if they are to be
matched as regular characters. For example, \^abc\\
corresponds to the string ^abc\, whereas ^abc
corresponds to the string abc at the beginning of a
line.
CCCCllllaaaasssssssseeeessss ooooffff cccchhhhaaaarrrraaaacccctttteeeerrrrssss
a list of characters inside [] (in order) corresponds
to any character from the list. For example, [a-ho-z]
is any character between a and h or between o and z.
The symbol `^' inside [] complements the list. For
example, [^i-n] denote any character in the character
set except character 'i' to 'n'. The symbol `^' thus
has two meanings, but this is consistent with egrep.
The symbol `.' (don't care) stands for any symbol
(except for the newline symbol).
BBBBoooooooolllleeeeaaaannnn ooooppppeeeerrrraaaattttiiiioooonnnnssss
GGGGlllliiiimmmmppppsssseeee supports an `AND' operation denoted by the
symbol `;' an `OR' operation denoted by the symbol `,',
or any combination. For example, _g_l_i_m_p_s_e
'_p_i_z_z_a;_c_h_e_e_s_e_b_u_r_g_e_r' will output all lines containing
both patterns. _g_l_i_m_p_s_e -_F '_g_n_u;\._c$' '_d_e_f_i_n_e;_D_E_F_A_U_L_T'
will output all lines containing both 'define' and
'DEFAULT' (anywhere in the line, not necessarily in
order) in files whose name contains 'gnu' and ends with
.c. _g_l_i_m_p_s_e '{_p_o_l_i_t_i_c_a_l,_c_o_m_p_u_t_e_r};_s_c_i_e_n_c_e' will match
'political science' or 'science of computers'.
WWWWiiiilllldddd ccccaaaarrrrddddssss
The symbol '#' is used to denote a sequence of any
number (including 0) of arbitrary characters (see
LIMITATIONS). The symbol # is equivalent to .* in
egrep. In fact, .* will work too, because it is a
valid regular expression (see below), but unless this
is part of an actual regular expression, # will work
faster. (Currently glimpse is experiencing some
problems with #.)
CCCCoooommmmbbbbiiiinnnnaaaattttiiiioooonnnn ooooffff eeeexxxxaaaacccctttt aaaannnndddd aaaapppppppprrrrooooxxxxiiiimmmmaaaatttteeee mmmmaaaattttcccchhhhiiiinnnngggg
Any pattern inside angle brackets <> must match the
text exactly even if the match is with errors. For
example, <mathemat>ics matches mathematical with one
error (replacing the last s with an a), but
mathe<matics> does not match mathematical no matter how
Page 9 (printed 11/3/95)
GGGGLLLLIIIIMMMMPPPPSSSSEEEE((((llll)))) UUUUNNNNIIIIXXXX SSSSyyyysssstttteeeemmmm VVVV ((((OOOOccccttttoooobbbbeeeerrrr 11111111,,,, 1111999999995555)))) GGGGLLLLIIIIMMMMPPPPSSSSEEEE((((llll))))
many errors are allowed. (This option is buggy at the
moment.)
RRRReeeegggguuuullllaaaarrrr eeeexxxxpppprrrreeeessssssssiiiioooonnnnssss
Since the index is word based, a regular expression
must match words that appear in the index for glimpse
to find it. Glimpse first strips the regular
expression from all non-alphabetic characters, and
searches the index for all remaining words. It then
applies the regular expression matching algorithm to
the files found in the index. For example, _g_l_i_m_p_s_e
'abc.*xyz' will search the index for all files that
contain both 'abc' and 'xyz', and then search directly
for 'abc.*xyz' in those files. (If you use glimpse -w
'abc.*xyz', then 'abcxyz' will not be found, because
glimpse will think that abc and xyz need to be matches
to whole words.) The syntax of regular expressions in
gggglllliiiimmmmppppsssseeee is in general the same as that for aaaaggggrrrreeeepppp. The
union operation `|', Kleene closure `*', and
parentheses () are all supported. Currently '+' is not
supported. Regular expressions are currently limited
to approximately 30 characters (generally excluding
meta characters). Some options (-d, -w, -t, -x, -D,
-I, -S) do not currently work with regular expressions.
The maximal number of errors for regular expressions
that use '*' or '|' is 4. (See LIMITATIONS.)
ssssttttrrrruuuuccccttttuuuurrrreeeedddd qqqquuuueeeerrrriiiieeeessss
Glimpse supports some form of structured queries using
Harvest's SOIF format. See STRUCTURED QUERIES below
for details.
EEEEXXXXAAAAMMMMPPPPLLLLEEEESSSS
(Run "glimpse '^glimpse' this-file" to get a list of all
examples, some of which were given earlier.)
glimpse -F 'haystack.h$' needle
finds all needles in all haystack.h's files.
glimpse -2 -F html Anestesiology
outputs all occurrences of Anestesiology with two
errors in files with html somewhere in their full name.
glimpse -l -F '.c$' variablename
lists the names of all .c files that contain
variablename (the -l option lists file names rather
than output the matched lines).
glimpse -F 'mail;1993' 'windsurfing;Arizona'
finds all lines containing _w_i_n_d_s_u_r_f_i_n_g and _A_r_i_z_o_n_a in
all files having `mail' and '1993' somewhere in their
full name.
Page 10 (printed 11/3/95)
GGGGLLLLIIIIMMMMPPPPSSSSEEEE((((llll)))) UUUUNNNNIIIIXXXX SSSSyyyysssstttteeeemmmm VVVV ((((OOOOccccttttoooobbbbeeeerrrr 11111111,,,, 1111999999995555)))) GGGGLLLLIIIIMMMMPPPPSSSSEEEE((((llll))))
glimpse -F mail 't.j@#uk'
finds all mail addresses (search only files with mail
somewhere in their name) from the uk, where the login
name ends with t.j, where the . stands for any one
character. (This is very useful to find a login name of
someone whose middle name you don't know.)
glimpse -F mbox -h -G . > MBOX
concatenates all files whose name matches `mbox' into
one big one.
SSSSEEEEAAAARRRRCCCCHHHHIIIINNNNGGGG IIIINNNN CCCCOOOOMMMMPPPPRRRREEEESSSSSSSSEEEEDDDD FFFFIIIILLLLEEEESSSS
Glimpse includes an optional new compression program, called
_c_a_s_t, which allows glimpse (and agrep) to search the
compressed files without having to decompress them. The
search is actually significantly faster when the files are
compressed. However, we have not tested _c_a_s_t as thoroughly
as we would have liked, and a mishap in a compression
algorithm can cause loss of data, so we recommend at this
point to use _c_a_s_t very carefully. (Unless you specifically
use _c_a_s_t, the default is to ignore it.)
GGGGLLLLIIIIMMMMPPPPSSSSEEEEIIIINNNNDDDDEEEEXXXX FFFFIIIILLLLEEEESSSS
All files used by glimpse are located at the directory(ies)
where the index(es) is (are) stored and have .glimpse_ as a
prefix. The first two files (.glimpse_exclude and
.glimpse_include) are optionally supplied by the user. The
other files are built and read by glimpse.
....gggglllliiiimmmmppppsssseeee____eeeexxxxcccclllluuuuddddeeee
contains a list of files that glimpseindex is
explicitly told to ignore. In general, the syntax of
.glimpse_exclude/include is the same as that of agrep
(or any other grep). The lines in the .glimpse_exclude
file are matched to the file names, and if they match,
the files are excluded. Notice that agrep matches to
parts of the string! e.g., agrep /ftp/pub will match
/home/ftp/pub and /ftp/pub/whatever. So, if you want
to exclude /ftp/pub/core, you just list it, as is, in
the .glimpse_exclude file. If you put
"/home/ftp/pub/cdrom" in .glimpse_exclude, every file
name that matches that string will be excluded, meaning
all files below it. You can use ^ to indicate the
beginning of a file name, and $ to indicate the end of
one, and you can use * and ? in the usual way. For
example /ftp/*html will exclude /ftp/pub/foo.html, but
will also exclude /home/ftp/pub/html/whatever; if you
want to exclude files that start with /ftp and end with
html use ^/ftp*html$ Notice that putting a * at the
beginning or at the end is redundant (in fact, in this
case glimpseindex will remove the * when it does the
indexing). No other meta characters are allowed in
Page 11 (printed 11/3/95)
GGGGLLLLIIIIMMMMPPPPSSSSEEEE((((llll)))) UUUUNNNNIIIIXXXX SSSSyyyysssstttteeeemmmm VVVV ((((OOOOccccttttoooobbbbeeeerrrr 11111111,,,, 1111999999995555)))) GGGGLLLLIIIIMMMMPPPPSSSSEEEE((((llll))))
.glimpse_exclude (e.g., don't use .* or # or |). Lines
with * or ? must have no more than 30 characters.
Notice that, although the index itself will not be
indexed, the list of file names (.glimpse_filenames)
will be indexed unless it is explicitly listed in
.glimpse_exclude.
....gggglllliiiimmmmppppsssseeee____ffffiiiilllltttteeeerrrrssss
See the description above for the -z option.
....gggglllliiiimmmmppppsssseeee____iiiinnnncccclllluuuuddddeeee
contains a list of files that glimpseindex is
explicitly told to _i_n_c_l_u_d_e in the index even though
they may look like non-text files. Symbolic links are
followed by glimpseindex only if they are specifically
included here. If a file is in both .glimpse_exclude
and .glimpse_include it will be excluded.
....gggglllliiiimmmmppppsssseeee____ffffiiiilllleeeennnnaaaammmmeeeessss
contains the list of all indexed file names, one per
line. This is an ASCII file that can also be used with
agrep to search for a file name leading to a fast find
command. For example,
glimpse 'count#\.c$' ~/.glimpse_filenames
will output the names of all (indexed) .c files that
have 'count' in their name (including anywhere on the
path from the index). Setting the following alias in
the .login file may be useful:
alias findfile 'glimpse -h :1 ~/.glimpse_filenames'
.gggglllliiiimmmmppppsssseeee____iiiinnnnddddeeeexxxx
contains the index. The index consists of lines, each
starting with a word followed by a list of block
numbers (unless the -o or -b options are used, in which
case each word is followed by an offset into the file
.glimpse_partitions where all pointers are kept). The
block/file numbers are stored in binary form, so this
is not an ASCII file.
....gggglllliiiimmmmppppsssseeee____mmmmeeeessssssssaaaaggggeeeessss
contains the output of the -w option (see above).
....gggglllliiiimmmmppppsssseeee____ppppaaaarrrrttttiiiittttiiiioooonnnnssss
contains the partition of the indexed space into blocks
and, when the index is built with the -o or -b options,
some part of the index. This file is used internally
by glimpse and it is a non-ASCII file.
....gggglllliiiimmmmppppsssseeee____ssssttttaaaattttiiiissssttttiiiiccccssss
contains some statistics about the makeup of the index.
Useful for some advanced applications and customization
of glimpse.
Page 12 (printed 11/3/95)
GGGGLLLLIIIIMMMMPPPPSSSSEEEE((((llll)))) UUUUNNNNIIIIXXXX SSSSyyyysssstttteeeemmmm VVVV ((((OOOOccccttttoooobbbbeeeerrrr 11111111,,,, 1111999999995555)))) GGGGLLLLIIIIMMMMPPPPSSSSEEEE((((llll))))
....gggglllliiiimmmmppppsssseeee____ttttuuuurrrrbbbboooo
An added data structure (used under glimpseindex -o or
-b only) that helps to speed up queries significantly
for large indexes. Its size is 0.25MB. Glimpse will
work without it if needed.
SSSSTTTTRRRRUUUUCCCCTTTTUUUURRRREEEEDDDD QQQQUUUUEEEERRRRIIIIEEEESSSS
Glimpse can search for Boolean combinations of
"attribute=value" terms by using the Harvest SOIF parser
library (in glimpse/libtemplate). To search this way, the
index must be made by using the -s option of glimpseindex
(this can be used in conjunction with other glimpseindex
options). For glimpse and glimpseindex to recognize
"structured" files, they must be in SOIF format. In this
format, each value is prefixed by an attribute-name with the
size of the value (in bytes) present in "{}" after the name
of the attribute. For example, The following lines are part
of an SOIF file:
type{17}: Directory-Listing
md5{32}: 3858c73d68616df0ed58a44d306b12ba
Any string can serve as an attribute name. Glimpse
"pattern;type=Directory-Listing" will search for "pattern"
only in files whose type is "Directory-Listing". The file
itself is considered to be one "object" and its name/url
appears as the first attribute with an "@" prefix; e.g.,
@FILE { http://xxx... } The scope of Boolean operations
changes from records (lines) to whole files when structured
queries are used in glimpse (since individual query terms
can look at different attributes and they may not be
"covered" by the record/line). Note that glimpse can only
search for patterns in the value parts of the SOIF file:
there are some attributes (like the TTL, MD5, etc.) that are
interpreted by Harvest's internal routines. See
http://harvest.cs.colorado.edu/harvest/user-manual/ for more
detailed information of the SOIF format.
RRRREEEEFFFFEEEERRRREEEENNNNCCCCEEEESSSS
1. U. Manber and S. Wu, "GLIMPSE: A Tool to Search Through
Entire File Systems," _U_s_e_n_i_x _W_i_n_t_e_r _1_9_9_4 _T_e_c_h_n_i_c_a_l
_C_o_n_f_e_r_e_n_c_e, San Francisco (January 1994), pp. 23-32.
Also, Technical Report #TR 93-34, Dept. of Computer
Science, University of Arizona, October 1993 (a
postscript file is available by anonymous ftp at
cs.arizona.edu:reports/1993/TR93-34.ps).
2. S. Wu and U. Manber, "Fast Text Searching Allowing
Errors," _C_o_m_m_u_n_i_c_a_t_i_o_n_s _o_f _t_h_e _A_C_M 33335555 (October 1992),
pp. 83-91.
SSSSEEEEEEEE AAAALLLLSSSSOOOO
aaaaggggrrrreeeepppp(1), eeeedddd(1), eeeexxxx(1), gggglllliiiimmmmppppsssseeeeiiiinnnnddddeeeexxxx(1), gggglllliiiimmmmppppsssseeeesssseeeerrrrvvvveeeerrrr(1),
ggggrrrreeeepppp(1), sssshhhh(1), ccccsssshhhh(1).
Page 13 (printed 11/3/95)
GGGGLLLLIIIIMMMMPPPPSSSSEEEE((((llll)))) UUUUNNNNIIIIXXXX SSSSyyyysssstttteeeemmmm VVVV ((((OOOOccccttttoooobbbbeeeerrrr 11111111,,,, 1111999999995555)))) GGGGLLLLIIIIMMMMPPPPSSSSEEEE((((llll))))
LLLLIIIIMMMMIIIITTTTAAAATTTTIIIIOOOONNNNSSSS
The index of glimpse is word based. A pattern that contains
more than one word cannot be found in the index. The way
glimpse overcomes this weakness is by splitting any multi-
word pattern into its set of words and looking for all of
them in the index. For example, gggglllliiiimmmmppppsssseeee ''''lllliiiinnnneeeeaaaarrrr
pppprrrrooooggggrrrraaaammmmmmmmiiiinnnngggg'''' will first consult the index to find all files
containing both _l_i_n_e_a_r and _p_r_o_g_r_a_m_m_i_n_g, and then apply agrep
to find the combined pattern. This is usually an effective
solution, but it can be slow for cases where both words are
very common, but their combination is not.
As was mentioned in the section on PATTERNS above, some
characters serve as meta characters for glimpse and need to
be preceded by '\' to search for them. The most common
examples are the characters '.' (which stands for a wild
card), and '*' (the Kleene closure). So, "glimpse ab.de"
will match abcde, but "glimpse ab\.de" will not, and
"glimpse ab*de" will not match ab*de, but "glimpse ab\*de"
will. The meta character - is translated automatically to a
hypen unless it appears between [] (in which case it denotes
a range of characters).
The index of glimpse stores all patterns in lower case.
When glimpse searches the index it first converts all
patterns to lower case, finds the appropriate files, and
then searches the actual files using the original patterns.
So, for example, _g_l_i_m_p_s_e _A_B_C_X_Y_Z will first find all files
containing abcxyz in any combination of lower and upper
cases, and then searches these files directly, so only the
right cases will be found. One problem with this approach
is discovering misspellings that are caused by wrong cases.
For example, _g_l_i_m_p_s_e -_B _a_b_c_X_Y_Z will first search the index
for the best match to abcxyz (because the pattern is
converted to lower case); it will find that there are
matches with no errors, and will go to those files to search
them directly, this time with the original upper cases. If
the closest match is, say AbcXYZ, glimpse may miss it,
because it doesn't expect an error. Another problem is
speed. If you search for "ATT", it will look at the index
for "att". Unless you use -w to match the whole word,
glimpse may have to search all files containing, for
example, "Seattle" which has "att" in it.
There is no size limit for simple patterns and simple
patterns within Boolean expressions. More complicated
patterns, such as regular expressions, are currently limited
to approximately 30 characters. Lines are limited to 1024
characters. Records are limited to 48K, and may be
truncated if they are larger than that. The limit of record
length can be changed by modifying the parameter Max_record
in agrep.h.
Page 14 (printed 11/3/95)
GGGGLLLLIIIIMMMMPPPPSSSSEEEE((((llll)))) UUUUNNNNIIIIXXXX SSSSyyyysssstttteeeemmmm VVVV ((((OOOOccccttttoooobbbbeeeerrrr 11111111,,,, 1111999999995555)))) GGGGLLLLIIIIMMMMPPPPSSSSEEEE((((llll))))
Glimpseindex does not index words of size > 64.
BBBBUUUUGGGGSSSS
A Boolean AND query that includes two patterns one of which
is a prefix of the other (or equal to the other) may not
work correctly. Essentially glimpse will find the smallest
pattern first, but will not backtrack to try to check again
if it matches another pattern. (We are not sure whether
this is a bug or a feature, because there is no apparent
reason to have patterns like that.)
A Boolean query with a pattern of length 1 (i.e., one
character only) may miss matches.
In some rare cases, regular expressions using * or # may not
match correctly.
A query that contains no alphanumeric characters is not
recommended (unless glimpse is used as agrep and the file
names are provided). This is an understatement.
Please send bug reports or comments to
glimpse@cs.arizona.edu.
DDDDIIIIAAAAGGGGNNNNOOOOSSSSTTTTIIIICCCCSSSS
Exit status is 0 if any matches are found, 1 if none, 2 for
syntax errors or inaccessible files.
AAAAUUUUTTTTHHHHOOOORRRRSSSS
Udi Manber and Burra Gopal, Department of Computer Science,
University of Arizona, and Sun Wu, the National Chung-Cheng
University, Taiwan. (Email: glimpse@cs.arizona.edu)
Page 15 (printed 11/3/95)